For this project, I wanted to learn more about NHL player salary. I explore salary in the context of different determinants, including country, US city, draft year, and team. I also use a random forest regression model to predict salary and identify the most important features in prediction.
Note: see the data source (https://www.kaggle.com/datasets/camnugent/predict-nhl-player-salaries) for a complete list of column names and their abbreviations.
The median salary overall was $925,000. The maximum
salary was $13,800,000, and the minimum was
$575,000. Left-handed players are paid slightly more than
right-handed players, with a 75th percentile salary nearly 13% higher.
However, some of this difference might be attributed to more left-handed
players being present in the dataset. Canada and the US had at least one
player receiving the maximum salary. The US median matches the overall
median, and the Canada median is $950,000. Seeing
relatively high median salaries are the US cities of Madison,
Minneapolis, and St. Paul, each matching or surpassing
$4,000,000. Ann Arbor matches the overall median, and
Detroit just falls short of it.
Among the most important determining features are date of birth
(age), draft year, time on ice divided by games played, team goals while
player on ice, and team shots on goal while player on ice. The random
forest regression model predicts salary with a mean absolute error of
around $1,000,000, which is about 8% of the overall range.
The root mean square error is around $1,600,000, which is
close to 12% of the overall range.
Improved feature engineering and feature selection might potentially increase the accuracy of this model further. Some columns could be split up into separate columns (e.g.: date of birth into year of birth, month, etc.), and some could be eliminated due to similarity (e.g.: CF (team’s shot attempts while player on ice), FF (team’s unblocked attempts while player on ice), SF (team’s shots on goal while player on ice)). An ensemble method might be useful as well. With three models created (one with all columns, one with negative importance columns removed, and one with a hyperparameter-tuned second model), accuracy improved primarily in the ten thousands range. The set of “important features,” as determined by IncNodePurity and %IncMSE charts, remained relatively consistent throughout model improvement.
salary_data <- read.csv("train.csv")
head(salary_data)
## Salary Born City Pr.St Cntry Nat Ht Wt DftYr DftRd Ovrl Hand
## 1 925000 97-01-30 Sainte-Marie QC CAN CAN 74 190 2015 1 18 L
## 2 2250000 93-12-21 Ottawa ON CAN CAN 74 207 2012 1 15 R
## 3 8000000 88-04-16 St. Paul MN USA USA 72 218 2006 1 7 R
## 4 3500000 92-01-07 Ottawa ON CAN CAN 77 220 2010 1 3 R
## 5 1750000 94-03-29 Toronto ON CAN CAN 76 217 2012 1 16 R
## 6 1500000 79-05-23 Strathroy ON CAN CAN 70 192 1997 6 156 L
## Last.Name First.Name Position Team GP G A A1 A2 PTS X... E... PIM Shifts
## 1 Chabot Thomas D OTT 1 0 0 0 0 0 -2 0.0 0 13
## 2 Ceci Cody D OTT 79 2 15 6 9 17 -11 -10.4 20 2418
## 3 Okposo Kyle RW BUF 65 19 26 13 13 45 -7 -1.4 24 1443
## 4 Gudbranson Erik D VAN 30 1 5 5 0 6 -14 -5.3 18 765
## 5 Wilson Tom RW WSH 82 7 12 4 8 19 9 4.1 133 1453
## 6 Campbell Brian D CHI 80 5 12 6 6 17 12 0.7 24 1896
## TOI TOIX TOI.GP TOI.GP.1 TOI. IPP. SH. SV. PDO F.60 A.60 Pct. Diff
## 1 429 7.2 7.15 7.16 15.2 0.0 0.0 0.750 750 0.00 16.74 0.0 -2
## 2 109992 1826.2 23.20 23.17 39.0 30.4 7.4 0.915 989 1.84 2.79 39.7 -29
## 3 73983 1229.2 18.97 18.95 33.1 63.4 9.7 0.934 1031 3.47 1.95 64.0 31
## 4 36603 607.9 20.33 20.31 36.1 37.5 6.2 0.897 959 1.58 3.45 31.4 -19
## 5 63592 1059.7 12.93 12.93 23.5 61.3 7.8 0.917 995 1.76 2.32 43.1 -10
## 6 88462 1473.7 18.43 18.43 32.4 23.9 9.8 0.936 1033 2.89 1.91 60.2 24
## Diff.60 iCF iCF.1 iFF iSF iSF.1 iSF.2 ixG iSCF iRB iRS iDS sDist sDist.1
## 1 -16.74 2 2 2 1 1 1 0.0 0 0 0 0 43.0 49.3
## 2 -0.95 287 287 197 143 143 143 6.1 7 7 9 16 52.4 46.3
## 3 1.51 283 283 212 155 156 156 17.4 64 16 20 36 28.4 26.3
## 4 -1.88 88 88 55 40 40 40 1.4 2 1 4 5 55.1 51.0
## 5 -0.57 166 166 118 95 95 95 9.3 35 8 10 18 30.9 26.4
## 6 0.98 171 171 110 75 74 75 4.5 7 2 3 5 46.1 41.9
## Pass iHF iHF.1 iHA iHDf iMiss iGVA iTKA iBLK iGVA.1 iTKA.1 iBLK.1 BLK. iFOW
## 1 0.0 1 1 0 1 1 1 0 0 1 0 0 0.0 0
## 2 138.1 111 111 154 -43 54 74 22 159 74 22 159 8.0 1
## 3 196.8 53 53 68 -15 57 36 26 25 36 26 25 2.4 54
## 4 153.0 66 66 66 0 15 23 4 44 23 4 44 7.3 0
## 5 96.3 239 239 134 105 23 21 36 44 21 36 44 4.4 3
## 6 95.7 43 43 157 -114 35 59 11 83 59 11 83 6.5 0
## iFOL iFOW.1 iFOL.1 FO. X.FOT dzFOW dzFOL nzFOW nzFOL ozFOW ozFOL FOW.Up
## 1 0 0 0 0.0 0.0 0 0 0 0 0 0 0
## 2 0 1 0 100.0 0.1 1 0 0 0 0 0 0
## 3 45 54 45 54.5 7.4 9 6 10 11 35 28 13
## 4 0 0 0 0.0 0.0 0 0 0 0 0 0 0
## 5 7 3 7 30.0 1.0 1 1 0 2 2 4 3
## 6 0 0 0 0.0 0.0 0 0 0 0 0 0 0
## FOL.Up FOW.Down FOL.Down FOW.Close FOL.Close OTG X1G GWG ENG PSG PSA G.Bkhd
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 1 0 0 0 0 0 0 0 0
## 3 10 21 16 37 33 1 5 2 0 0 0 5
## 4 0 0 0 0 0 0 0 1 0 0 0 0
## 5 1 0 3 2 6 0 2 0 1 0 0 0
## 6 0 0 0 0 0 0 1 1 0 0 0 0
## G.Dflct G.Slap G.Snap G.Tip G.Wrap G.Wrst CBar Post Over Wide S.Bkhd S.Dflct
## 1 0 0 0 0 0 0 0 0 0 1 0 0
## 2 0 1 0 0 0 1 0 1 2 51 2 0
## 3 2 0 3 0 0 9 0 2 4 51 19 3
## 4 0 0 0 0 0 1 0 0 0 15 0 0
## 5 0 0 0 1 0 6 0 2 1 20 7 2
## 6 0 2 0 0 0 2 0 1 6 28 2 0
## S.Slap S.Snap S.Tip S.Wrap S.Wrst iPenT iPenD iPENT iPEND iPenDf NPD Min
## 1 1 0 0 0 0 0 0 0 0 0 0.0 0
## 2 49 12 0 1 79 10 6 10 5 -4 2.2 10
## 3 3 20 8 2 101 12 10 11 8 -2 -0.5 12
## 4 18 3 0 0 19 6 7 6 6 1 2.7 4
## 5 3 10 11 1 61 44 33 40 29 -11 -14.3 33
## 6 32 9 1 0 30 12 11 12 8 -1 5.4 12
## Maj Match Misc Game CF CA FF FA SF SA xGF xGA SCF SCA GF GA RBF
## 1 0 0 0 0 9 12 8 10 5 8 0.5 0.9 2 3 0 2 1
## 2 0 0 0 0 1433 1992 1038 1423 757 997 62.0 88.8 197 280 56 85 68
## 3 0 0 0 0 1301 1051 986 826 734 606 70.8 46.4 235 133 71 40 60
## 4 2 0 0 0 460 605 339 467 259 340 22.0 33.6 80 130 16 35 27
## 5 9 0 1 1 766 992 546 720 398 495 33.5 47.5 124 159 31 41 30
## 6 0 0 0 0 1356 1281 971 972 728 730 62.9 59.9 210 197 71 47 30
## RBA RSF RSA DSF DSA FOW FOL HF HA GVA TKA PENT PEND OPS DPS PS OTOI
## 1 1 0 1 1 2 4 5 1 2 1 1 1 1 0.0 -0.2 -0.2 40.03
## 2 82 79 94 147 176 949 939 749 671 284 197 104 98 -0.2 3.4 3.2 2850.59
## 3 34 76 52 136 86 739 600 340 351 168 129 56 70 3.7 1.3 5.0 2486.75
## 4 20 29 32 56 52 324 328 198 197 86 59 26 22 0.0 0.4 0.5 1074.41
## 5 37 43 53 73 90 528 490 512 422 157 126 88 68 -0.1 1.4 1.3 3459.09
## 6 56 58 85 88 141 570 667 348 707 223 168 76 60 0.6 3.7 4.3 3069.81
## Grit DAP Pace GS GS.G
## 1 1 0.0 175.7 -0.4 -0.38
## 2 290 13.3 112.5 14.1 0.18
## 3 102 6.6 114.8 36.8 0.57
## 4 130 17.5 105.1 5.9 0.20
## 5 425 8.3 99.5 21.8 0.27
## 6 150 4.5 107.4 20.8 0.26
There are over 150 columns in this dataset, meaning the correlation heatmap for the full dataset will be hard to read. Regardless, I’ll make one as a big-picture starting point. I’ll also make a pairplot.
suppressWarnings({
library(tidyr)
library(corrplot)
library(dplyr)
})
## corrplot 0.92 loaded
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
print(ncol(salary_data))
## [1] 154
print(unique(colnames(salary_data)))
## [1] "Salary" "Born" "City" "Pr.St" "Cntry"
## [6] "Nat" "Ht" "Wt" "DftYr" "DftRd"
## [11] "Ovrl" "Hand" "Last.Name" "First.Name" "Position"
## [16] "Team" "GP" "G" "A" "A1"
## [21] "A2" "PTS" "X..." "E..." "PIM"
## [26] "Shifts" "TOI" "TOIX" "TOI.GP" "TOI.GP.1"
## [31] "TOI." "IPP." "SH." "SV." "PDO"
## [36] "F.60" "A.60" "Pct." "Diff" "Diff.60"
## [41] "iCF" "iCF.1" "iFF" "iSF" "iSF.1"
## [46] "iSF.2" "ixG" "iSCF" "iRB" "iRS"
## [51] "iDS" "sDist" "sDist.1" "Pass" "iHF"
## [56] "iHF.1" "iHA" "iHDf" "iMiss" "iGVA"
## [61] "iTKA" "iBLK" "iGVA.1" "iTKA.1" "iBLK.1"
## [66] "BLK." "iFOW" "iFOL" "iFOW.1" "iFOL.1"
## [71] "FO." "X.FOT" "dzFOW" "dzFOL" "nzFOW"
## [76] "nzFOL" "ozFOW" "ozFOL" "FOW.Up" "FOL.Up"
## [81] "FOW.Down" "FOL.Down" "FOW.Close" "FOL.Close" "OTG"
## [86] "X1G" "GWG" "ENG" "PSG" "PSA"
## [91] "G.Bkhd" "G.Dflct" "G.Slap" "G.Snap" "G.Tip"
## [96] "G.Wrap" "G.Wrst" "CBar" "Post" "Over"
## [101] "Wide" "S.Bkhd" "S.Dflct" "S.Slap" "S.Snap"
## [106] "S.Tip" "S.Wrap" "S.Wrst" "iPenT" "iPenD"
## [111] "iPENT" "iPEND" "iPenDf" "NPD" "Min"
## [116] "Maj" "Match" "Misc" "Game" "CF"
## [121] "CA" "FF" "FA" "SF" "SA"
## [126] "xGF" "xGA" "SCF" "SCA" "GF"
## [131] "GA" "RBF" "RBA" "RSF" "RSA"
## [136] "DSF" "DSA" "FOW" "FOL" "HF"
## [141] "HA" "GVA" "TKA" "PENT" "PEND"
## [146] "OPS" "DPS" "PS" "OTOI" "Grit"
## [151] "DAP" "Pace" "GS" "GS.G"
salary_data_numeric <- salary_data %>% mutate_if(~ !is.numeric(.), ~ as.numeric(factor(.)))
salary_data_numeric <- salary_data_numeric %>% drop_na()
print(paste("Salary Data Rows: ", nrow(salary_data)))
## [1] "Salary Data Rows: 612"
print(paste("Salary Data Rows, n/a rows excluded: ", nrow(salary_data_numeric)))
## [1] "Salary Data Rows, n/a rows excluded: 494"
Below is focused exploration on salary.
ggplot(mapping=aes(x=Salary), data=salary_data_numeric) +
geom_histogram(aes(fill=factor(Salary)), show.legend = FALSE, bins=35) +
labs(title = "Salary Histogram")
ggplot(mapping=aes(x=Salary), data=salary_data_numeric) +
geom_density(color="darkred", show.legend=FALSE) +
labs(title = "Salary Density Plot")
ggplot(mapping=aes(y=Salary), data=salary_data_numeric) +
geom_boxplot(color="darkred", show.legend=FALSE) +
labs(title = "Salary Box Plot")
print(summary(salary_data_numeric$Salary))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 575000 800000 1025000 2487720 4000000 13800000
res <- cor(salary_data_numeric, method="kendall")
corrplot(res, tl.cex=0.3, na.label.col="snow",
method="color")
As expected, only some columns have Kendall correlations with Salary. Let’s select the columns with a correlation greater than 0.35.
greater_35_corr_vars <- colnames(res[, abs(res["Salary",]) > 0.35])
greater_35_corr_vars <- greater_35_corr_vars[!is.na(greater_35_corr_vars)]
print(greater_35_corr_vars)
## [1] "Salary" "GP" "G" "A" "A1" "A2"
## [7] "PTS" "Shifts" "TOI" "TOIX" "TOI.GP" "TOI.GP.1"
## [13] "TOI." "iCF" "iCF.1" "iFF" "iSF" "iSF.1"
## [19] "iSF.2" "ixG" "iRB" "iRS" "iDS" "Pass"
## [25] "iMiss" "iGVA" "iTKA" "iGVA.1" "iTKA.1" "GWG"
## [31] "G.Wrst" "Over" "Wide" "S.Slap" "S.Snap" "S.Wrst"
## [37] "iPenT" "iPENT" "Min" "CF" "CA" "FF"
## [43] "FA" "SF" "SA" "xGF" "xGA" "SCF"
## [49] "SCA" "GF" "GA" "RBF" "RBA" "RSF"
## [55] "RSA" "DSF" "DSA" "FOW" "FOL" "HF"
## [61] "HA" "GVA" "TKA" "PENT" "PEND" "OPS"
## [67] "DPS" "PS" "OTOI" "GS" "GS.G"
salary_dnc <- salary_data_numeric %>% select(all_of(greater_35_corr_vars))
summary(salary_dnc)
## Salary GP G A
## Min. : 575000 Min. : 1.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 800000 1st Qu.:28.00 1st Qu.: 1.000 1st Qu.: 3.00
## Median : 1025000 Median :66.00 Median : 6.000 Median :11.50
## Mean : 2487720 Mean :54.25 Mean : 8.466 Mean :14.27
## 3rd Qu.: 4000000 3rd Qu.:79.00 3rd Qu.:13.000 3rd Qu.:22.00
## Max. :13800000 Max. :82.00 Max. :44.000 Max. :63.00
## A1 A2 PTS Shifts
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 13
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 510
## Median : 6.000 Median : 5.000 Median :17.00 Median :1369
## Mean : 7.941 Mean : 6.334 Mean :22.74 Mean :1196
## 3rd Qu.:12.000 3rd Qu.: 9.000 3rd Qu.:37.00 3rd Qu.:1805
## Max. :36.000 Max. :31.000 Max. :89.00 Max. :2657
## TOI TOIX TOI.GP TOI.GP.1
## Min. : 429 Min. : 7.2 Min. : 6.25 Min. : 6.25
## 1st Qu.: 21415 1st Qu.: 353.5 1st Qu.:12.33 1st Qu.:12.31
## Median : 59199 Median : 986.7 Median :15.62 Median :15.61
## Mean : 54668 Mean : 906.9 Mean :15.55 Mean :15.54
## 3rd Qu.: 83812 3rd Qu.:1387.2 3rd Qu.:18.38 3rd Qu.:18.37
## Max. :133550 Max. :2218.9 Max. :27.15 Max. :27.12
## TOI. iCF iCF.1 iFF
## Min. :13.00 Min. : 1.0 Min. : 1.0 Min. : 1.0
## 1st Qu.:23.12 1st Qu.: 61.0 1st Qu.: 61.0 1st Qu.: 41.5
## Median :27.70 Median :163.0 Median :164.0 Median :124.0
## Mean :27.78 Mean :172.3 Mean :172.3 Mean :128.9
## 3rd Qu.:32.30 3rd Qu.:260.8 3rd Qu.:260.8 3rd Qu.:191.8
## Max. :44.90 Max. :624.0 Max. :582.0 Max. :455.0
## iSF iSF.1 iSF.2 ixG
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 29.25 1st Qu.: 29.25 1st Qu.: 29.25 1st Qu.: 2.100
## Median : 86.00 Median : 86.00 Median : 86.00 Median : 6.600
## Mean : 93.12 Mean : 93.33 Mean : 93.36 Mean : 8.371
## 3rd Qu.:140.00 3rd Qu.:139.75 3rd Qu.:140.00 3rd Qu.:13.100
## Max. :313.00 Max. :313.00 Max. :313.00 Max. :33.000
## iRB iRS iDS Pass
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 4.00 1st Qu.: 44.9
## Median : 5.000 Median : 7.000 Median :12.50 Median :127.5
## Mean : 6.609 Mean : 8.024 Mean :14.63 Mean :151.2
## 3rd Qu.:10.000 3rd Qu.:12.000 3rd Qu.:22.00 3rd Qu.:231.2
## Max. :41.000 Max. :32.000 Max. :63.00 Max. :598.3
## iMiss iGVA iTKA iGVA.1
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 12.00 1st Qu.: 8.00 1st Qu.: 5.00 1st Qu.: 8.00
## Median : 34.00 Median : 22.00 Median :17.00 Median : 22.00
## Mean : 35.93 Mean : 25.71 Mean :20.49 Mean : 25.65
## 3rd Qu.: 54.75 3rd Qu.: 39.00 3rd Qu.:32.00 3rd Qu.: 39.00
## Max. :142.00 Max. :106.00 Max. :96.00 Max. :106.00
## iTKA.1 GWG G.Wrst Over
## Min. : 0.00 Min. :0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 5.00 1st Qu.:0.000 1st Qu.: 0.000 1st Qu.: 1.000
## Median :17.00 Median :1.000 Median : 2.000 Median : 3.000
## Mean :20.44 Mean :1.447 Mean : 4.198 Mean : 3.607
## 3rd Qu.:32.00 3rd Qu.:2.000 3rd Qu.: 7.000 3rd Qu.: 5.000
## Max. :96.00 Max. :9.000 Max. :30.000 Max. :19.000
## Wide S.Slap S.Snap S.Wrst
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 10.00 1st Qu.: 2.00 1st Qu.: 3.00 1st Qu.: 14.0
## Median : 28.50 Median : 9.00 Median :10.00 Median : 44.0
## Mean : 30.44 Mean : 16.39 Mean :14.59 Mean : 48.1
## 3rd Qu.: 46.00 3rd Qu.: 22.00 3rd Qu.:21.00 3rd Qu.: 71.0
## Max. :112.00 Max. :141.00 Max. :78.00 Max. :214.0
## iPenT iPENT Min CF
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 8.0
## 1st Qu.: 4.0 1st Qu.: 4.00 1st Qu.: 4.00 1st Qu.: 307.0
## Median :10.0 Median :10.00 Median : 9.00 Median : 877.0
## Mean :11.2 Mean :10.73 Mean :10.09 Mean : 850.8
## 3rd Qu.:16.0 3rd Qu.:16.00 3rd Qu.:15.00 3rd Qu.:1324.2
## Max. :48.0 Max. :44.00 Max. :39.00 Max. :2308.0
## CA FF FA SF
## Min. : 6.0 Min. : 5.0 Min. : 5.0 Min. : 4.0
## 1st Qu.: 336.5 1st Qu.: 224.5 1st Qu.: 244.0 1st Qu.: 163.0
## Median : 895.5 Median : 655.0 Median : 670.0 Median : 471.5
## Mean : 823.7 Mean : 634.8 Mean : 614.8 Mean : 456.9
## 3rd Qu.:1215.5 3rd Qu.: 987.8 3rd Qu.: 914.0 3rd Qu.: 714.8
## Max. :2273.0 Max. :1668.0 Max. :1765.0 Max. :1181.0
## SA xGF xGA SCF
## Min. : 2.0 Min. : 0.20 Min. : 0.40 Min. : 0.00
## 1st Qu.: 177.5 1st Qu.: 13.70 1st Qu.: 14.82 1st Qu.: 45.25
## Median : 483.0 Median : 39.75 Median : 41.75 Median :129.50
## Mean : 442.2 Mean : 41.02 Mean : 39.16 Mean :136.57
## 3rd Qu.: 656.0 3rd Qu.: 64.42 3rd Qu.: 57.77 3rd Qu.:210.00
## Max. :1276.0 Max. :111.10 Max. :108.00 Max. :419.00
## SCA GF GA RBF
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 48.0 1st Qu.: 11.25 1st Qu.: 13.00 1st Qu.: 10.00
## Median :137.0 Median : 37.00 Median : 40.00 Median : 29.00
## Mean :130.4 Mean : 40.54 Mean : 37.69 Mean : 32.55
## 3rd Qu.:195.8 3rd Qu.: 67.00 3rd Qu.: 57.00 3rd Qu.: 50.00
## Max. :344.0 Max. :120.00 Max. :100.00 Max. :110.00
## RBA RSF RSA DSF
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:11.25 1st Qu.: 14.25 1st Qu.: 15.00 1st Qu.: 25.00
## Median :30.00 Median : 39.00 Median : 40.00 Median : 70.50
## Mean :30.30 Mean : 39.71 Mean : 39.37 Mean : 72.27
## 3rd Qu.:45.00 3rd Qu.: 59.75 3rd Qu.: 58.00 3rd Qu.:109.75
## Max. :95.00 Max. :130.00 Max. :112.00 Max. :213.00
## DSA FOW FOL HF
## Min. : 0.00 Min. : 4.0 Min. : 4.0 Min. : 0.0
## 1st Qu.: 27.00 1st Qu.: 149.2 1st Qu.: 163.0 1st Qu.:157.2
## Median : 75.50 Median : 461.5 Median : 478.0 Median :343.0
## Mean : 69.67 Mean : 447.7 Mean : 444.5 Mean :327.4
## 3rd Qu.:103.00 3rd Qu.: 696.0 3rd Qu.: 674.5 3rd Qu.:464.0
## Max. :185.00 Max. :1257.0 Max. :1196.0 Max. :926.0
## HA GVA TKA PENT
## Min. : 2.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.:169.8 1st Qu.: 49.25 1st Qu.: 36.0 1st Qu.: 21.00
## Median :343.0 Median :132.50 Median :102.5 Median : 55.00
## Mean :324.2 Mean :132.64 Mean :100.0 Mean : 51.47
## 3rd Qu.:468.8 3rd Qu.:203.75 3rd Qu.:149.0 3rd Qu.: 76.00
## Max. :870.0 Max. :388.00 Max. :347.0 Max. :123.00
## PEND OPS DPS PS
## Min. : 0.00 Min. :-1.500 Min. :-0.200 Min. :-1.200
## 1st Qu.: 22.00 1st Qu.: 0.000 1st Qu.: 0.400 1st Qu.: 0.400
## Median : 53.00 Median : 0.600 Median : 1.100 Median : 2.200
## Mean : 51.03 Mean : 1.478 Mean : 1.433 Mean : 2.913
## 3rd Qu.: 76.00 3rd Qu.: 2.575 3rd Qu.: 2.000 3rd Qu.: 4.875
## Max. :137.00 Max. :10.500 Max. : 7.200 Max. :12.900
## OTOI GS GS.G
## Min. : 33.51 Min. :-3.500 Min. :-0.5900
## 1st Qu.:1072.76 1st Qu.: 4.925 1st Qu.: 0.1725
## Median :2632.04 Median :18.550 Median : 0.3300
## Mean :2149.35 Mean :24.315 Mean : 0.3673
## 3rd Qu.:3067.10 3rd Qu.:39.350 3rd Qu.: 0.5675
## Max. :3521.78 Max. :99.200 Max. : 1.2600
This is a little more legible, but still hard to decipher which variables have the strongest correlation with salary. We can take a quick look at the correlation colors present in the top row of the map. As far as salary goes, there are a lot of so-so, sky blue correlations; none seem to be a darker shade of blue. There also appears to be stronger correlations between some indicator variables, as shown by the presence of darker colored squares in the heatmap.
res_35greater <- cor(salary_dnc, method="kendall")
corrplot(res_35greater, na.label="square", na.label.col="snow", tl.cex = 0.5,
method="color")
Finally, let’s take an up-close look at variables with the strongest correlations. There are six variables with correlation scores greater than 0.5.
strongest_cols <- colnames(res[, abs(res["Salary",]) > 0.5])
strongest_cols <- strongest_cols[!is.na(strongest_cols)]
print(strongest_cols)
## [1] "Salary" "CF" "FF" "SF" "xGF" "SCF" "GF"
salary_most_cut <- salary_data_numeric %>% select(all_of(strongest_cols))
summary(salary_most_cut)
## Salary CF FF SF
## Min. : 575000 Min. : 8.0 Min. : 5.0 Min. : 4.0
## 1st Qu.: 800000 1st Qu.: 307.0 1st Qu.: 224.5 1st Qu.: 163.0
## Median : 1025000 Median : 877.0 Median : 655.0 Median : 471.5
## Mean : 2487720 Mean : 850.8 Mean : 634.8 Mean : 456.9
## 3rd Qu.: 4000000 3rd Qu.:1324.2 3rd Qu.: 987.8 3rd Qu.: 714.8
## Max. :13800000 Max. :2308.0 Max. :1668.0 Max. :1181.0
## xGF SCF GF
## Min. : 0.20 Min. : 0.00 Min. : 0.00
## 1st Qu.: 13.70 1st Qu.: 45.25 1st Qu.: 11.25
## Median : 39.75 Median :129.50 Median : 37.00
## Mean : 41.02 Mean :136.57 Mean : 40.54
## 3rd Qu.: 64.42 3rd Qu.:210.00 3rd Qu.: 67.00
## Max. :111.10 Max. :419.00 Max. :120.00
The six variables are CF (team’s shot attempts while player on ice), FF (team’s unblocked attempts while player on ice), SF (team’s shots on goal while player on ice), xGF (team’s expected goals while player on ice, by attempts by location), SCF (team’s scoring chances while player on ice), and GF (team’s goals while player on ice).
Since a lot of these variables are similar or related, it makes sense that there is a high correlation score between them.
res_cut2 <- cor(salary_most_cut, method="kendall")
corrplot(res_cut2, na.label="square", tl.cex = 1, addCoef.col="white", method="color")
Here are scatterplots for these six variables. A linear regression line
is included. These variables may or may not have a linear correlation,
but rather an exponential correlation, polynomial, simple monotonic
relationship, etc. The linear line helps compare overall shapes and
positive trends in an exploratory chapter. Included is a pairplot with
histograms on the diagonals.
for (variable in colnames(salary_most_cut)) {
print(ggplot(mapping=aes(y=salary_most_cut$Salary, x=salary_most_cut[[variable]])) +
geom_point(color="red") +
geom_smooth(method="lm", formula = y ~ x, color="darkred") +
labs(title = paste("Salary vs.", variable),
x = variable,
y = "Salary"))
}
#from R help page
#help(pairs)
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "red", ...)
}
pairs(salary_most_cut, diag.panel = panel.hist)
Here are some graphs related to salary. One explores the relationship
between salary and team goals while the player was on ice. Another shows
a scatterplot between salary and goals scored by the player themself.
Both of these scatterplots separate points into two color groups – one
for left-handed players and another for right-handed. Finally, there is
a boxplot of salary for left-handed and for right-handed players.
Interestingly, there are more left-handed players in the dataset than right-handed. There doesn’t seem to be a significant trend on the scatterplots, however, it is interesting that three out of the top five salaries are given to left-handed players. The boxplots show that the median salary for both left- and right-handed players is the same (925,000), but the IQR (3,087,750), upper fence (7,500,000), and maximum (including outliers) (13,800,000) are higher for left-handed players are larger than those for right-handed players (2,582,500; 7,000,000; 12,000,000 respectively). The 75th percentile salary is nearly 13% higher for left-handed players than right-handed.
suppressWarnings({library(plotly)})
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
sd_left <- salary_data[salary_data$Hand == "L",]
sd_right <- salary_data[salary_data$Hand == "R",]
ggplot(mapping=aes(x=Hand), data=salary_data) +
geom_bar(fill = c("darkred", "red")) +
labs(title = "Dominant Hand Bar Chart")
ggplot(mapping=aes(y=Salary, x=GF, color=factor(Hand)), data=salary_data) +
geom_point(na.rm=TRUE) +
scale_color_manual(values=c("darkred", "red")) +
labs(title = "Salary vs. Team Goals while Player on Ice (GF) - Right/Left Handed") +
geom_smooth(method="lm", formula = y ~ x, data=sd_left, color = "darkred", na.rm=TRUE) +
geom_smooth(method="lm", formula = y ~ x, data=sd_right, color="red", na.rm=TRUE)
ggplot(mapping=aes(y=Salary, x=G, color=factor(Hand)), data=salary_data) +
geom_point(na.rm=TRUE) +
scale_color_manual(values=c("darkred", "red")) +
labs(title = "Salary vs. Team Goals while Player on Ice (GF) - Right/Left Handed") +
geom_smooth(method="lm", formula = y ~ x, data=sd_left, color = "darkred") +
geom_smooth(method="lm", formula = y ~ x, data=sd_right, color="red")
hand_plot <- ggplot(mapping=aes(y=Salary, x=Hand), data=salary_data) +
geom_boxplot(fill = c("darkred", "red"), show.legend=FALSE) +
#scale_color_manual(values = c("darkred", "red")) +
labs(title = "Boxplots of Salary by Dominant Hand") +
theme(axis.text.x = element_text(angle=90))
ggplotly(hand_plot)
Below are the same graphs, but this time, color groups are separated by country.
ggplot(mapping=aes(y=Salary, x=GF, color=factor(Cntry)), data=salary_data) +
geom_point(na.rm=TRUE) +
labs(title = "Salary vs. Team Goals while Player on Ice (GF) - Country")
ggplot(mapping=aes(y=Salary, x=G, color=factor(Cntry)), data=salary_data) +
geom_point(na.rm=TRUE) +
labs(title = "Salary vs. Team Goals while Player on Ice (GF) - Country")
Here is a boxplot for salary by country, and another which breaks the
prior graph into dominant hand groups. With 291 players, Canada has the
most players in the dataset by far. Canada is followed by the US with
168. Canada and the US both pay the maximum salary in the dataset
(13,800,000) to at least one of their players, and Canada pays the
highest right-handed salary (12,000,000) as well. Many countries have
higher median values for left-handed players than right-handed, although
this could be due to more left-handed players appearing in the
dataset.
cntry_bar <- ggplot(mapping=aes(x=Cntry), data=salary_data) +
geom_bar(mapping=aes(fill=factor(Cntry)), show.legend=FALSE) +
labs(title = "Distribution of Countries") +
theme(axis.text.x = element_text(angle=90))
ggplotly(cntry_bar)
cntry_sal <- ggplot(mapping=aes(y=Salary, x=Cntry), data=salary_data) +
geom_boxplot(mapping=aes(fill=factor(Cntry))) +
labs(title = "Boxplots of Salary by Country")
ggplotly(cntry_sal)
cntry_sal_hand <- ggplot(mapping=aes(x=Cntry, y=Salary), data=salary_data) +
geom_boxplot(mapping=aes(fill=factor(Cntry))) +
labs(title = "Boxplots of Salary by Country - Dominant Hand") +
theme(axis.text = element_text(angle=90)) +
facet_wrap(~Hand)
ggplotly(cntry_sal_hand)
Here’s a summary of salary in the USA.
sd_usa <- salary_data[salary_data$Cntry == "USA",]
summary(sd_usa$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 575000 667500 925000 2047664 2500000 13800000
Below is a distribution graph of US cities in the dataset and boxplots of salary by US city. Because there are many cities in the dataset, I made these plots interactive, so that zooming in and out and moving around the graph would be possible.
Ann Arbor, Madison, and Rochester have the most players in the dataset, followed by Buffalo, Minneapolis, and Pittsburgh. The majority of US cities in the dataset include data for one player.
The US player receiving the maximum salary (13,800,000) is from Buffalo, a city with a median salary of 1,587,500. Ann Arbor’s median salary matches the country’s median of 925,000, while Detroit fell short at 842,500. Madison had a median of 4,000,000, St. Paul 5,000,000, and Minneapolis 6,500,000.
plot1 <- ggplot(mapping=aes(x=City), data=sd_usa) +
geom_bar(mapping=aes(fill=factor(City)), show.legend=FALSE) +
labs(title = "Distribution of US Cities") +
theme(axis.text.x = element_text(angle=90),
axis.text = element_text(size=5))
ggplotly(plot1)
plot2 <- ggplot(mapping=aes(y=Salary, x=City), data=sd_usa) +
geom_boxplot(mapping=aes(fill=factor(City)), show.legend=FALSE) +
labs(title = "Boxplots of Salary by US City") +
theme(axis.text.x = element_text(angle=90),
axis.text = element_text(size=5))
ggplotly(plot2)
Here are some plots exploring salary by team and draft year.
ggplot(mapping=aes(x=Team), data=sd_usa) +
geom_bar(mapping=aes(fill=factor(Team)), show.legend=FALSE) +
labs(title = "Distribution of US Teams") +
theme(axis.text.x = element_text(angle=90))
ggplot(mapping=aes(y=Salary, x=Team), data=salary_data) +
geom_boxplot(mapping=aes(fill=factor(Team)), show.legend=FALSE) +
labs(title = "Boxplots of Salary by Team") +
theme(axis.text.x = element_text(angle=90),
axis.text = element_text(size=5))
ggplot(mapping=aes(x=DftYr), data=sd_usa) +
geom_bar(mapping=aes(fill=factor(DftYr)), show.legend=FALSE, na.rm=TRUE) +
labs(title = "Distribution of Draft Year in US") +
theme(axis.text.x = element_text(angle=90))
ggplot(mapping=aes(y=Salary, x=DftYr), data=salary_data) +
geom_boxplot(mapping=aes(fill=factor(DftYr)), show.legend=FALSE, na.rm=TRUE) +
labs(title = "Boxplots of Salary by Draft Year") +
theme(axis.text.x = element_text(angle=90))
In this section, I want to use random forest regression to identify the most important determinants of salary. I will use feature selection and hyperparameter tuning to increase the accuracy of the model.
setwd("C:\\Users\\eaalc\\OneDrive - Umich\\hockey_salary\\archive (11)")
y_train_data <- read.csv("test.csv")
y_test_data <- read.csv("test_salaries.csv")
#y_train_data is original csv, with all predictor columns
#y_train_test_data will include the salary column in y_test_data
y_train_test_data <- y_train_data
y_train_test_data$Salary <- y_test_data$Salary
#drop na values in all of y, so that salary is dropped with null predictor values
y_train_test_data <- y_train_test_data %>% drop_na()
#only using salary from y_train_test_data. drop na values from y_train_data
y_train_data <- y_train_data %>% drop_na()
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
suppressWarnings({library(randomForestExplainer)})
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Achieved mean absolute error (MAE) of $1,088,142 and
root mean square error (RMSE) of $1,609,394. Identified as
most important determining factors are date of birth, draft year, time
on ice divided by games played (TOI.GP), and team shots on goal while
the player was on ice (SF).
The output of this model shows a variety of different visualizations and text. The first block shows a summary of the model, prints the MAE and RMSE, and prints the importance of model features in descending order. The “Predicted vs. Actual” scatterplot shows how predicted values compare to actual values. The closer these points are to the line y=x, the better the predictions fared. “rf_results” shows how error in the model changed with number of trees used. This is useful in understanding over- and under-fitting of the random forest to the training dataset. Finally, and most importantly, the final graph prints the Top 10 most important model features as measured by IncNodePurity and %IncMSE.
set.seed(123)
salary_data <- salary_data %>% drop_na()
rf_results <- randomForest(Salary ~ ., data=salary_data, localImp=TRUE)
summary(rf_results)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 494 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 494 -none- numeric
## importance 306 -none- numeric
## importanceSD 153 -none- numeric
## localImportance 75582 -none- numeric
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 494 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
#generate predictions with y_train_data
rf_predictions <- predict(rf_results, y_train_data)
#getting salary from y_train_test_data and storing in variable name real_ytest
real_ytest <- y_train_test_data$Salary
ggplot(mapping=aes(y=real_ytest, x=rf_predictions)) +
geom_point(color="darkred") +
labs(title = "Predicted Salary vs. Actual",
x = "Predicted Salary",
y = "Actual Salary") +
geom_abline(intercept=0, slope=1, color="red")
suppressWarnings({library(Metrics)})
print(paste("MEAN ABSOLUTE ERROR:", mae(rf_predictions, real_ytest)))
## [1] "MEAN ABSOLUTE ERROR: 1088142.03037281"
print(paste("ROOT MEAN SQUARED ERROR:", rmse(rf_predictions, real_ytest)))
## [1] "ROOT MEAN SQUARED ERROR: 1609394.15926762"
plot(rf_results)
varImpPlot(rf_results, sort=TRUE, n.var = 10,
main = "Top 10 Variables used in RF Model", pch=19,
color="darkred")
print(sort(importance(rf_results)[,1], decreasing=TRUE))
## Born DftYr TOI.GP TOI. TOI.GP.1 SF
## 24.48222577 22.92848345 11.73406395 10.80076317 10.72727988 9.86906766
## xGF Ovrl GF FF SCF FOL
## 8.51776919 7.83402677 7.32961556 6.92807044 6.79411475 6.73076629
## CF OTOI GP FOW GS.G Shifts
## 6.45466561 6.36383943 6.08440565 5.90583900 5.01699464 4.91578453
## HF PS iSF.2 iFF sDist iSF.1
## 4.85718340 4.74014576 4.72451063 4.60501501 4.48470960 4.46351801
## iBLK S.Wrap OTG DftRd sDist.1 GS
## 4.43222014 4.28252046 4.25308398 4.25149411 4.19343920 4.17737833
## ixG iSF iCF.1 iBLK.1 GA GWG
## 4.13608080 4.10507228 4.06631218 4.02116744 3.99973279 3.93946130
## SCA S.Wrst A2 TOI S.Snap iCF
## 3.86058335 3.76362410 3.76254270 3.69436912 3.67939623 3.67020059
## CBar Wide Grit iMiss iRS iSCF
## 3.66607979 3.64237524 3.54208130 3.52131453 3.43988878 3.38394176
## PTS iTKA.1 ozFOW S.Tip RSF Wt
## 3.33109010 3.32519030 3.23117426 3.21365817 3.20093490 3.14135932
## iDS OPS S.Slap RBF iGVA.1 A
## 3.13877335 3.13628792 3.10939014 3.08130268 3.06144052 3.03103471
## GVA DSF FA xGA PENT TKA
## 3.02156781 2.99173075 2.99092786 2.97854212 2.91429328 2.90935273
## DPS IPP. Pass RBA CA Min
## 2.86103426 2.84384853 2.79503073 2.78882920 2.77850239 2.76795625
## iHF.1 iPenT F.60 TOIX SH. E...
## 2.71190341 2.68984836 2.60064294 2.56786992 2.51455251 2.49144599
## RSA iTKA BLK. SA Diff S.Bkhd
## 2.44407266 2.42134460 2.39243282 2.23916173 2.21108858 2.20867573
## iHA DSA DAP iHF ENG HA
## 2.14498353 2.13420356 2.07068288 2.04714856 2.04630795 2.04021212
## iGVA G.Snap Over Misc PIM iPenD
## 2.03025917 2.02108296 1.97112616 1.95798089 1.92718457 1.78309563
## Pct. Cntry Pace PDO FOL.Down iFOW.1
## 1.74884573 1.73607069 1.69019169 1.54957404 1.54792494 1.52947761
## G.Tip nzFOW iFOL dzFOW Team Game
## 1.52420729 1.50334434 1.48112138 1.37910173 1.35451700 1.24394345
## G FOW.Down Pr.St ozFOL nzFOL A1
## 1.19016434 1.17640653 1.13783830 1.00629117 0.98189746 0.95266848
## iRB PEND iPENT S.Dflct Diff.60 Ht
## 0.92155709 0.88172925 0.87681677 0.78930881 0.77622052 0.74119449
## Position iHDf iPenDf A.60 Post PSA
## 0.59831217 0.57297713 0.44108611 0.40748425 0.26687492 0.26609235
## G.Wrst Match FO. Nat iFOL.1 NPD
## 0.04820722 0.00000000 -0.05155809 -0.06414688 -0.06892691 -0.16877666
## iFOW FOL.Close X1G SV. G.Dflct Maj
## -0.20062277 -0.29874328 -0.38931997 -0.44871140 -0.46487547 -0.49517645
## iPEND X.FOT dzFOL Hand X... FOL.Up
## -0.61487557 -0.75830981 -0.82737891 -0.87631340 -0.90911246 -1.00538887
## G.Bkhd PSG FOW.Up G.Slap G.Wrap FOW.Close
## -1.12974574 -1.33542546 -1.35273615 -1.47222833 -1.62454432 -1.92858999
## Last.Name First.Name City
## -2.15034579 -2.21685057 -2.37076719
feature_importances <- c(sort(importance(rf_results)[,1], decreasing=TRUE))
While I still see columns that may be creating noise in the model (e.g.: first name), I will cut the negative importance columns.
After cutting the noisy columns, this model achieved a MAE of
$1,076,131 and RMSE of $1,603,168. The overall
set of important features remained very similar, however, the
IncNodePurity chart increased the importance level of Draft Year and
team goals scored while player on ice (GF).
print_rf_results <- function(model, ytrain_data, ytest_data) {
print(summary(model))
rf_predicted <- predict(model, ytrain_data)
ggplot(mapping=aes(y=ytest_data, x=rf_predicted)) +
geom_point(color="darkred") +
labs(title = "Predicted Salary vs. Actual",
x = "Predicted Salary",
y = "Actual Salary") +
geom_abline(intercept=0, slope=1, color="red")
print(paste("MEAN ABSOLUTE ERROR:", mae(rf_predicted, ytest_data)))
print(paste("ROOT MEAN SQUARED ERROR:", rmse(rf_predicted, ytest_data)))
plot(model)
varImpPlot(model, sort=TRUE, n.var = 10,
main = "Top 10 Variables used in RF Model", pch=19,
color="darkred")
print(sort(importance(model)[,1], decreasing=TRUE))
feature_importances <- c(sort(importance(model)[,1], decreasing=TRUE))
return(feature_importances)
}
#getting rid of features with negative importance and storing in variable
#"important_features"
important_features <- feature_importances[feature_importances > 0]
#subsetting salary_data to get all important features
X_train <- salary_data %>% select(all_of(names(important_features)))
#including salary column in the one dataset
X_train$Salary <- salary_data$Salary
#X_train has ALL predictor values as well as the Salary column. It functions as
#X_train and X_test.
#subsetting y_train_data to get all important features
y_train <- y_train_data %>% select(all_of(names(important_features)))
#storing salary column of y_train_test into y_test -- since important_features only drops
#columns, y_train_test_data should be fine as is
y_test <- y_train_test_data$Salary
rf_resultsNW <- randomForest(Salary ~ ., data=X_train, localImp=TRUE)
feature_importancesNW <- print_rf_results(rf_resultsNW, y_train, y_test)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 494 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 494 -none- numeric
## importance 254 -none- numeric
## importanceSD 127 -none- numeric
## localImportance 62738 -none- numeric
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 494 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
## [1] "MEAN ABSOLUTE ERROR: 1076131.70506213"
## [1] "ROOT MEAN SQUARED ERROR: 1603168.40741544"
## Born DftYr TOI.GP TOI.GP.1 TOI. SF
## 22.51636736 21.72454682 11.96781846 11.18372177 10.16464320 8.36165988
## xGF GF SCF FF Ovrl GP
## 7.61187451 7.60899141 7.35698957 6.99510385 6.63465414 6.61548641
## FOL CF OTOI FOW iSF.1 RSA
## 6.51583676 6.47466745 6.37114615 6.09274955 5.23686643 4.63424449
## HF F.60 SCA Shifts iSF PS
## 4.34888463 4.34517272 4.23697879 4.22948025 4.12548343 4.05097382
## Over S.Snap iSCF GVA iSF.2 ozFOW
## 4.02191939 4.02191531 4.01502704 3.97610775 3.90008350 3.83677857
## iCF iTKA.1 DSF iPenT sDist TKA
## 3.81522500 3.71710578 3.71599836 3.70959921 3.65075738 3.58744321
## GA TOIX iMiss iFF S.Slap A2
## 3.54249433 3.48309197 3.48132266 3.46495925 3.45016427 3.36769195
## S.Wrst GS iBLK RBA RBF SA
## 3.34368707 3.28533733 3.25152580 3.21542018 3.17701529 3.13802465
## GS.G iTKA RSF PIM DftRd Wide
## 3.12286370 3.12125206 3.05074890 3.03458540 3.01904691 2.96558252
## iRS PTS CBar iGVA CA OTG
## 2.95901899 2.87842256 2.84459651 2.83365242 2.82972265 2.82388651
## iGVA.1 iCF.1 S.Tip sDist.1 Misc iPenD
## 2.77337932 2.76161370 2.67549755 2.67208328 2.61461825 2.60739419
## Grit TOI iHA GWG iDS G.Snap
## 2.59633077 2.58869594 2.57847040 2.57514383 2.57194685 2.57039365
## iPENT iRB xGA iBLK.1 ixG DSA
## 2.47146217 2.43695482 2.42208076 2.37729426 2.36651813 2.34749771
## S.Wrap A PEND SH. OPS BLK.
## 2.32528795 2.31926603 2.28157041 2.27515794 2.27437233 2.23883888
## IPP. iHF.1 E... DPS Min FA
## 2.14613112 2.09510854 2.02797460 1.93275855 1.92437411 1.84374169
## Pace Pass G.Wrst PENT Pct. Position
## 1.82712928 1.80901946 1.74303679 1.72825345 1.51437362 1.45187133
## FOW.Down HA A.60 S.Bkhd Ht Post
## 1.35029181 1.34804615 1.28565954 1.22690339 1.22134462 1.20639003
## G iHF DAP Wt A1 FOL.Down
## 1.19834494 1.14859269 1.12356617 0.97089455 0.88310119 0.77119281
## nzFOW iFOW.1 Pr.St iPenDf S.Dflct Diff
## 0.66716668 0.45709188 0.34739970 0.33439499 0.32724110 0.02398969
## ozFOL G.Tip dzFOW Cntry Team nzFOL
## -0.04569959 -0.08244146 -0.09561549 -0.32614220 -0.42658866 -0.52081539
## PSA PDO Diff.60 iFOL ENG Game
## -0.66782499 -0.73859904 -0.79089063 -0.88207712 -1.39041383 -1.49393369
## iHDf
## -1.91849850
With hyperparameter tuning, this model saw slight improvements in MAE and RMSE. The overall set of important features remained close to the same.
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:Metrics':
##
## precision, recall
set.seed(123)
control <- trainControl(method="repeatedcv", number = 5, repeats = 3)
print(ncol(X_train))
## [1] 128
grid <- expand.grid(.mtry = seq(35,125, by = 15)) #default is (#predictors)/3, which would be 127/3, which is just over 42
trained_rf <- train(
Salary~.,
data=X_train,
trControl=control,
tuneGrid = grid,
method = "rf",
tuneLength = 5)
print(trained_rf)
## Random Forest
##
## 494 samples
## 127 predictors
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 397, 395, 394, 395, 395, 394, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 35 1591985 0.5508171 1082468.8
## 50 1563445 0.5677225 1060544.8
## 65 1545657 0.5781056 1046233.6
## 80 1528749 0.5877083 1032074.3
## 95 1513726 0.5962288 1023065.1
## 110 1500858 0.6034384 1014024.0
## 125 1482960 0.6142353 997661.3
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 125.
print(trained_rf$finalModel)
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 125
##
## Mean of squared residuals: 2.117251e+12
## % Var explained: 61.86
finalModel <- randomForest(Salary ~., data = X_train, localImp = TRUE, mtry = 85, ntree = 500)
print_rf_results(finalModel, y_train, y_test)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 494 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 494 -none- numeric
## importance 254 -none- numeric
## importanceSD 127 -none- numeric
## localImportance 62738 -none- numeric
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 494 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
## [1] "MEAN ABSOLUTE ERROR: 1083522.6124269"
## [1] "ROOT MEAN SQUARED ERROR: 1595992.86533047"
## Born DftYr TOI. TOI.GP TOI.GP.1 GF
## 26.764393367 24.008220239 12.804791398 12.414052458 11.421630101 10.051454961
## Ovrl SF xGF GP OTOI FF
## 9.351391868 9.008954733 7.133221487 6.483796266 6.355644812 6.297115912
## FOW FOL OTG GS iSF SCF
## 5.788700891 5.769495779 5.415682952 5.146832115 5.130163367 5.032195669
## CBar sDist S.Snap S.Slap S.Wrst iSCF
## 4.748727990 4.509959753 4.461433069 4.313212598 4.277235774 4.248517713
## Shifts CF GA iCF iSF.1 GVA
## 4.215448667 4.180624326 4.139789561 4.063415474 3.992185943 3.974876454
## RSA DSA DftRd HF SCA iMiss
## 3.955751691 3.834853297 3.813739363 3.810778951 3.775163741 3.690408899
## A2 Wide GS.G PS iRB ixG
## 3.676290786 3.645108607 3.633305438 3.610214150 3.595268660 3.480885471
## S.Tip TOI GWG Pace RSF iBLK.1
## 3.346414030 3.329585036 3.221806341 3.217713209 3.174221855 3.025100847
## TKA SH. iDS xGA PTS RBF
## 3.018045764 2.974892868 2.971259194 2.835857332 2.741014454 2.735896100
## ozFOW Diff iHF.1 Over FOW.Down DSF
## 2.722534258 2.685076261 2.666567624 2.664120804 2.579676300 2.561030820
## iCF.1 PENT iBLK PEND iHA DPS
## 2.557349361 2.528110773 2.526564711 2.508840054 2.449815740 2.435116068
## iPENT iPenD F.60 HA Pct. A
## 2.415971545 2.346187236 2.274454030 2.238342994 2.234317528 2.169024021
## iTKA SA Pass FA BLK. sDist.1
## 2.125826173 2.120084133 2.072541943 1.997675758 1.957988996 1.926816121
## iGVA.1 IPP. G.Snap CA Grit iFF
## 1.880160904 1.784961551 1.709598294 1.707276479 1.684499804 1.649876053
## PIM Min OPS A.60 PSA dzFOW
## 1.623735388 1.607664868 1.570998803 1.485317615 1.454454279 1.438152140
## iGVA iRS E... FOL.Down RBA iPenT
## 1.394277913 1.377221218 1.373461469 1.350924783 1.329429951 1.295351023
## iSF.2 Post TOIX Diff.60 G.Tip iTKA.1
## 1.275978881 1.267433739 1.267174176 1.250864070 1.071081145 0.884188406
## Misc S.Dflct ENG G.Wrst A1 Wt
## 0.829602185 0.777813781 0.776432704 0.621970052 0.611672853 0.577726108
## G S.Wrap ozFOL nzFOL S.Bkhd Ht
## 0.502518936 0.282796356 -0.005267235 -0.118169065 -0.129360937 -0.143281981
## DAP PDO iPenDf Cntry Pr.St Game
## -0.200812551 -0.234759389 -0.320382475 -0.470494862 -0.682933720 -0.696145437
## iHF iFOL nzFOW iFOW.1 Team iHDf
## -0.742009387 -0.823680443 -1.075200282 -1.240703523 -1.270223270 -1.438626597
## Position
## -1.503646733
## Born DftYr TOI. TOI.GP TOI.GP.1 GF
## 26.764393367 24.008220239 12.804791398 12.414052458 11.421630101 10.051454961
## Ovrl SF xGF GP OTOI FF
## 9.351391868 9.008954733 7.133221487 6.483796266 6.355644812 6.297115912
## FOW FOL OTG GS iSF SCF
## 5.788700891 5.769495779 5.415682952 5.146832115 5.130163367 5.032195669
## CBar sDist S.Snap S.Slap S.Wrst iSCF
## 4.748727990 4.509959753 4.461433069 4.313212598 4.277235774 4.248517713
## Shifts CF GA iCF iSF.1 GVA
## 4.215448667 4.180624326 4.139789561 4.063415474 3.992185943 3.974876454
## RSA DSA DftRd HF SCA iMiss
## 3.955751691 3.834853297 3.813739363 3.810778951 3.775163741 3.690408899
## A2 Wide GS.G PS iRB ixG
## 3.676290786 3.645108607 3.633305438 3.610214150 3.595268660 3.480885471
## S.Tip TOI GWG Pace RSF iBLK.1
## 3.346414030 3.329585036 3.221806341 3.217713209 3.174221855 3.025100847
## TKA SH. iDS xGA PTS RBF
## 3.018045764 2.974892868 2.971259194 2.835857332 2.741014454 2.735896100
## ozFOW Diff iHF.1 Over FOW.Down DSF
## 2.722534258 2.685076261 2.666567624 2.664120804 2.579676300 2.561030820
## iCF.1 PENT iBLK PEND iHA DPS
## 2.557349361 2.528110773 2.526564711 2.508840054 2.449815740 2.435116068
## iPENT iPenD F.60 HA Pct. A
## 2.415971545 2.346187236 2.274454030 2.238342994 2.234317528 2.169024021
## iTKA SA Pass FA BLK. sDist.1
## 2.125826173 2.120084133 2.072541943 1.997675758 1.957988996 1.926816121
## iGVA.1 IPP. G.Snap CA Grit iFF
## 1.880160904 1.784961551 1.709598294 1.707276479 1.684499804 1.649876053
## PIM Min OPS A.60 PSA dzFOW
## 1.623735388 1.607664868 1.570998803 1.485317615 1.454454279 1.438152140
## iGVA iRS E... FOL.Down RBA iPenT
## 1.394277913 1.377221218 1.373461469 1.350924783 1.329429951 1.295351023
## iSF.2 Post TOIX Diff.60 G.Tip iTKA.1
## 1.275978881 1.267433739 1.267174176 1.250864070 1.071081145 0.884188406
## Misc S.Dflct ENG G.Wrst A1 Wt
## 0.829602185 0.777813781 0.776432704 0.621970052 0.611672853 0.577726108
## G S.Wrap ozFOL nzFOL S.Bkhd Ht
## 0.502518936 0.282796356 -0.005267235 -0.118169065 -0.129360937 -0.143281981
## DAP PDO iPenDf Cntry Pr.St Game
## -0.200812551 -0.234759389 -0.320382475 -0.470494862 -0.682933720 -0.696145437
## iHF iFOL nzFOW iFOW.1 Team iHDf
## -0.742009387 -0.823680443 -1.075200282 -1.240703523 -1.270223270 -1.438626597
## Position
## -1.503646733